123VCF: an intuitive and efficient tool for filtering VCF files

Background The advent of Next-Generation Sequencing (NGS) has catalyzed a paradigm shift in medical genetics, enabling the identification of disease-associated variants. However, the vast quantum of data produced by NGS necessitates a robust and dependable mechanism for filtering irrelevant variants. Annotation-based variant filtering, a pivotal step in this process, demands a profound understanding of the case-specific conditions and the relevant annotation instruments. To tackle this complex task, we sought to design an accessible, efficient and more importantly easy to understand variant filtering tool. Results Our efforts culminated in the creation of 123VCF, a tool capable of processing both compressed and uncompressed Variant Calling Format (VCF) files. Built on a Java framework, the tool employs a disk-streaming real-time filtering algorithm, allowing it to manage sizable variant files on conventional desktop computers. 123VCF filters input variants in accordance with a predefined filter sequence applied to the input variants. Users are provided the flexibility to define various filtering parameters, such as quality, coverage depth, and variant frequency within the populations. Additionally, 123VCF accommodates user-defined filters tailored to specific case requirements, affording users enhanced control over the filtering process. We evaluated the performance of 123VCF by analyzing different types of variant files and comparing its runtimes to the most similar algorithms like BCFtools filter and GATK VariantFiltration. The results indicated that 123VCF performs relatively well. The tool's intuitive interface and potential for reproducibility make it a valuable asset for both researchers and clinicians. Conclusion The 123VCF filtering tool provides an effective, dependable approach for filtering variants in both research and clinical settings. As an open-source tool available at https://project123vcf.sourceforge.io, it is accessible to the global scientific and clinical community, paving the way for the discovery of disease-causing variants and facilitating the advancement of personalized medicine.


Background
The advent of next-generation sequencing (NGS) technologies has revolutionized the field of genomics, enabling the analysis of large-scale genomic datasets with unprecedented accuracy and resolution.However, the sheer volume of data generated by NGS requires efficient and reliable tools for variant analysis.This analysis typically involves the identification of disease-causing variants by filtering out irrelevant variants using annotation-based filtering, a critical step in the analysis pipeline that requires an understanding of both the case's conditions and available annotations [1,2].
This study aimed to develop 123VCF, a user-friendly and efficient GUI-based filtering tool that enables researchers and clinicians to define filters easily through a text file.123VCF employs a disk-streaming real-time filtering algorithm, efficiently processing variant files without the need to load them into the computer's memory.

Implementation
Effective variant filtering is a pivotal stage in Next-Generation Sequencing (NGS) data analysis, involving variant annotation and subsequent filtering based on user-defined criteria.However, traditional variant filtering tools often suffer from memory-intensive processes, especially when dealing with extensive datasets, as they load the entire input VCF file into memory before applying filters [13].To address this challenge, we introduce 123VCF, an innovative tool that employs a memory-efficient algorithm for variant filtering, eliminating the need to load the input VCF file into memory.This breakthrough not only ensures faster processing but also enables seamless handling of large datasets.
123VCF is a freely available, versatile, and cross-platform tool developed using Java Swing, and it is distributed under the MIT license.The tool provides users with a user-friendly graphical interface enabling them to filter VCF files based on annotations within the "INFO" and "FORMAT" fields.Additionally, researchers can easily isolate de novo variants in multi-sample VCF files by specifying genotypes for each sample.To ensure simplicity and independence from third-party codes, all components of 123VCF were entirely developed by the authors, resulting in a straightforward and lightweight tool.
The filtering process is initiated by conducting an analysis of the filtering order file in comparison to the header section of the submitted VCF file, ensuring a comprehensive evaluation.Subsequently, each filter is systematically applied to every variant, employing intricate regular expressions rules tailored for string and numerical based filters.Through this advanced approach, only those variants that successfully meet all specified criteria, both in terms of string matching and numerical operations, are selected and documented in the designated output file(s).The underlying algorithm's core concept is visualized in Fig. 1, providing a clear representation of the Fig. 1 123VCF algorithm's steps methodology employed by 123VCF for efficient variant filtering.With its ease of use and powerful filtering capabilities, 123VCF emerges as a valuable tool for researchers and bioinformaticians in diverse genomic analyses.
123VCF offers users the flexibility to include or exclude heterozygous and homozygous variants from the sample, allowing for precise and customized filtering.The tool can generate a Tab-Separated Values (TSV) file containing all passed variants, which can be easily imported into spreadsheet-based programs for further analysis.Additionally, 123VCF can generate another TSV file specifically for variants that overlap with a user-provided BED file, allowing researchers and clinicians to identify possible compound heterozygous variants.These TSV files provide a convenient and customizable way to prioritize and analyze variants of interest.The efficiency of 123VCF were evaluated using a set of variant files and also compared to the most similar algorithms, demonstrating its ability to handle large datasets without compromising performance.The tool's disk-streaming real-time filtering algorithm was found to be efficient, providing accurate filtering results in a short amount of time.
123VCF provides a robust functionality that allows users to define and apply custom filtration orders using plain text files, as outlined in the user manual.This feature offers a high level of convenience, enabling users to utilize their laboratory-specific filters repeatedly without limitations.By incorporating this feature, users can streamline their workflow and enhance reproducibility, ultimately improving the efficiency and accuracy of their analysis.Furthermore, to facilitate the use of this feature, we have provided several filtering order files along with the tool, providing users with a starting point for customizing their own filtering orders.

Results
In order to demonstrate the efficacy of 123VCF, a thorough benchmark analysis was conducted using a diverse collection of VCF files from prominent projects [10,[15][16][17].To ensure consistency in annotations, ANNOVAR with identical databases was employed for all six VCF files [5].The benchmark comprised VCF files with varying numbers of variants and samples, and the condensed results are presented in Table 2, providing information on variant and sample counts, annotated VCF file sizes, applied filters, and run time of 123VCF, BCFtools filter and GATK VariantFiltration in seconds.
Table 2 clearly shows that 123VCF is an expeditious and effective filtering tool capable of processing large VCF files within seconds.The algorithm of 123VCF demonstrated precision in filtering variants in large VCF files while maintaining optimal performance, providing a significant tool for variant analysis to researchers and clinicians.It is crucial to highlight that 123VCF adopts a distinct filtering strategy compared to other available tools, making direct comparisons challenging.Nevertheless, our rigorous benchmark analysis demonstrates that 123VCF is an exceptionally efficient tool, particularly when multiple impactful filters are employed.In this benchmark, we chose to compare 123VCF with the most similar algorithms, BCFtools filter and GATK VariantFiltration tools.The runtimes of the similar tools are included in the rightmost columns of Table 2.It is important to highlight that we utilized identical uncompressed non-indexed VCF files for this benchmark.
A notable factor affecting 123VCF's performance is the I/O speed of the hard disks.Utilizing Solid-State Drives (SSD) hard drives can significantly enhance its efficiency.To optimize runtimes, we introduced an option to remove filtered-out variants from the output files, as organizing variants in the output files was identified as the most timeintensive operation in our algorithm.Additionally, 123VCF's ability to handle varying file sizes with little impact on performance makes it an invaluable resource for researchers dealing with different scales of data in NGS data analysis.

Conclusion
In conclusion, the development of 123VCF has yielded a highly efficient VCF file filtering tool with notable advantages over existing filtering tools.The tool's versatility in allowing users to define filters based on any desired annotation, and its filtering algorithm contribute to its efficacy in genetic analysis.
Another significant advantage of 123VCF is its standalone architecture, which allows users to run the tool on a local computer without requiring an internet connection.This ensures the privacy of submitted information, making it a highly secure tool for genetic analysis.
In addition, we added a command line interface to 123VCF to make it even more user-friendly and reproducible.This will allow users to easily automate their analyses and integrate 123VCF into their existing workflows.We believe that this new feature will further increase the accessibility of 123VCF and streamline the analysis process.Our team is dedicated to providing the best possible user experience, and we are excited to continue innovating and improving the tool in the future.